Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make schema in DBApiHook private #17423

Conversation

potiuk
Copy link
Member

@potiuk potiuk commented Aug 4, 2021

There was a change in #16521 that introduced schema field in
DBApiHook, but unfortunately using it in provider Hooks deriving
from DBApiHook is backwards incompatible for Airflow 2.1 and below.

This caused Postgres 2.1.0 release backwards incompatibility and
failures for Airflow 2.1.0.

Since the change is small and most of DBApi-derived hooks already
set the schema field on their own, the best approach is to
make the schema field private for the DBApiHook and make a change
in Postgres Hook to store the schema in the same way as all other
operators.

Fixes: #17422


^ Add meaningful description above

Read the Pull Request Guidelines for more information.
In case of fundamental code change, Airflow Improvement Proposal (AIP) is needed.
In case of a new dependency, check compliance with the ASF 3rd Party License Policy.
In case of backwards incompatible changes please leave a note in UPDATING.md.

@boring-cyborg boring-cyborg bot added area:core-operators Operators, Sensors and hooks within Core Airflow area:providers labels Aug 4, 2021
@potiuk potiuk requested a review from jedcunningham August 4, 2021 20:39
@@ -70,6 +70,7 @@ def __init__(self, *args, **kwargs) -> None:
super().__init__(*args, **kwargs)
self.connection: Optional[Connection] = kwargs.pop("connection", None)
self.conn: connection = None
self.schema: Optional[str] = kwargs.pop("schema", None)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead, maybe this:

Suggested change
self.schema: Optional[str] = kwargs.pop("schema", None)
if not hasattr(self, "schema"):
self.schema: Optional[str] = kwargs.pop("schema", None)

That should work for core both pre and post 2.2.0 and wouldn't break behavior of get_uri either like doing __schema does.

Copy link
Member Author

@potiuk potiuk Aug 4, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The problem is that having schema as attribute in DBApiHook will encourage people to use it in their other Hooks deriving from DBApiHook. It will be rather difficult to make sure that everyone is using it in "hasattr" way. Even if we two remember it now, we will forget about it and other committers who are not involved will not even know about this. We could potentially automate a pre-commit check if the "DBApiHook's schema" is accessed with hasattr, but I think we will never be able to do it with 100% accuracy and I think it's simply not worth it for that single field, that can be simply replaced with single line for each operator that wants to use it (in init):

        self.schema: Optional[str] = kwargs.pop("schema", None)

I think we should think about DBApiHook as "Public" API of Airflow and any change to it should be very, very carefully considered.

Another option would be to add >=Airflow 2.2 limitation to Postgres operator (and any other operator that uses it), but again I think sacrificing backwards compatibility in this case is simply not worth it.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah actually I see an error .in my solution.. I should not do "pop" in the args :)

Copy link
Member Author

@potiuk potiuk Aug 4, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed "pop" from DBApi to leave it for other hooks.

Actually I just realized the way it was implemented before - with pop() had undesired side effect for Hooks that already used kwargs.pop() .The side effect was that kwargs.pop() in DBApiHook would remove the schema from kwargs in derived classes and their self.schema = kwargs.pop("schema", None) would override the schema to None (!)

Example: mysql:

https://github.com/apache/airflow/blob/main/airflow/providers/mysql/hooks/mysql.py#L61

So in fact, the original change (not yet released luckily) in DBApiHook was even more disruptive than the failed PostgresHook. I am actually glad it was uncovered now, as it would be far more disruptive it was released in Airflow.

That's why we should be extremely careful with changing DBApiHook (and BaseHook and similar).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

😱 🙀

Copy link
Member Author

@potiuk potiuk Aug 4, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is unlikely edge case (much less likely than use of schema when it will be defined in DBApiHook as public field). I hardly see the use for that. Why somone would like to define a DBApi -derived hook and pass a 'schema' parameter to it while also changing it in it's own __init__? Seems extremely unlikely, also It feels natural that in such case the change should be done before super.__init__() rather than after.

We can assume that "schema" is our convention for kwarg para that we should use for all DB Hooks. We can standardise it (it's already commonly used) and release in Airflow 3 DBApiHook I think.

For now we might want to add some more docs/comments explaining it and some Airflow 3 notice about it. WDYT?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, I should have included an example. This is what I meant:

h = SomeHook(schema="foo")
h.get_uri() # uses foo
h.schema = "bar"
h.get_uri() # still uses foo because of __schema

That's why I'm thinking letting both DbApiHook and any derived hooks both set schema might be the best of both worlds here? Then the derived hooks could stop setting schema in Airflow 3 (or the next time they get a min core version bump really).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right - I think i fiigured out better way of doing it. I've added the 'schema' argument explicitly as optional kwarg to DBApiHook and made a comment about the "change schema value" in the description of the parameter.

I think it's much better - it makes schema an explicit part of the DBHookAPI hook's API, it is fully backwards compatible and it makes it very easy for Airflow 3 change - we will simply make the self.schema public and convert all the operators that use it in the "old" way. See the latest fixup.

Copy link
Member Author

@potiuk potiuk Aug 4, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's why I'm thinking letting both DbApiHook and any derived hooks both set schema might be the best of both worlds here? Then the derived hooks could stop setting schema in Airflow 3 (or the next time they get a min core version bump really).

The problem with this is - that it has already happened that we overlooked that the DBApiHook.schema object has been accessed directly by a provider, which made it backwards incompatible with Airflow 2.1. We do not have any mechanism to prevents this is in the future again, if we have it as a public field in DBApiHook. This is a bit of a problem that DBAPiHook is "public API" part and by making a public field in this class, we change the API.

The way I proposed - in the last fixup, schema becomes part of the DBApiHook 'initialization' API. If any other hook stores the schema field as self.schema - so be it, it is "its own responsibiliity" - we make it clear now in the constructor of the DBApiHook that passing "schema" as kwargs is THE way how to override the schema, and by not having a public field, we clearly say that the deriving hook cannot expect that it can change it and expect "DBApiHook" getUri() method will use it.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any other comments? I think We will not release an out-of-band Postgres operator, so this is something we will have to solve mid-August, but would be good to get some opinions :)

@potiuk potiuk force-pushed the fix-backwards-compatiblity-problem-with-dbapi-schema branch 2 times, most recently from 65afb15 to 844f421 Compare August 5, 2021 08:57
@potiuk
Copy link
Member Author

potiuk commented Aug 5, 2021

Fixed all the problems - looking for other's opinion of what will be the best solution to DBApiHook modifications vs. Providers.

@potiuk potiuk force-pushed the fix-backwards-compatiblity-problem-with-dbapi-schema branch from 844f421 to d5e4dd7 Compare August 5, 2021 20:17
There was a change in apache#16521 that introduced schema field in
DBApiHook, but unfortunately using it in provider Hooks deriving
from DBApiHook is backwards incompatible for Airflow 2.1 and below.

This caused Postgres 2.1.0 release backwards incompatibility and
failures for Airflow 2.1.0.

Since the change is small and most of DBApi-derived hooks already
set the schema field on their own, the best approach is to
make the schema field private for the DBApiHook and make a change
in Postgres Hook to store the schema in the same way as all other
operators.

Fixes: apache#17422
@potiuk potiuk force-pushed the fix-backwards-compatiblity-problem-with-dbapi-schema branch from d5e4dd7 to b9801ec Compare August 7, 2021 15:03
@potiuk potiuk requested review from ashb, kaxil, uranusjr and eladkal August 7, 2021 15:23
@potiuk potiuk added this to the Airflow 2.2 milestone Aug 7, 2021
@github-actions
Copy link

github-actions bot commented Aug 7, 2021

The PR most likely needs to run full matrix of tests because it modifies parts of the core of Airflow. However, committers might decide to merge it quickly and take the risk. If they don't merge it quickly - please rebase it to the latest main at your convenience, or amend the last commit of the PR, and push it with --force-with-lease.

@github-actions github-actions bot added the full tests needed We need to run full set of tests for this PR to merge label Aug 7, 2021
@potiuk potiuk merged commit 04b6559 into apache:main Aug 7, 2021
@potiuk potiuk deleted the fix-backwards-compatiblity-problem-with-dbapi-schema branch August 7, 2021 17:51
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area:core-operators Operators, Sensors and hooks within Core Airflow area:providers full tests needed We need to run full set of tests for this PR to merge
Projects
None yet
Development

Successfully merging this pull request may close these issues.

AttributeError: 'PostgresHook' object has no attribute 'schema'
3 participants